{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c25470fb-da7a-4bab-b1fa-c5e54c6ab96a",
   "metadata": {
    "tags": []
   },
   "source": [
    "# L2: Data Exploration for Tuning a Foundation Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eee94e3-0a12-4616-8a5b-cd0c78f88982",
   "metadata": {
    "tags": []
   },
   "source": [
    "**Project environment setup:**\n",
    "\n",
    "- Load credentials and relevant Python Libraries\n",
    "- If you were running this notebook locally, you would first install Vertex AI. In this classroom, this is already installed.\n",
    "\n",
    "```\n",
    "!pip install google-cloud-aiplatform\n",
    "```\n",
    "- You can download the `requirements.txt` for this course from the workspace of this lab. `File --> Open...`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8d58769-1fea-4c2e-8ce1-86a8489a6e41",
   "metadata": {
    "height": 46,
    "tags": []
   },
   "outputs": [],
   "source": [
    "from utils import authenticate\n",
    "credentials, PROJECT_ID = authenticate() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f53c0073-9723-40cd-8205-61a50729f5f6",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "REGION = \"us-central1\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9985efad-4a9e-4fd0-b35e-569d17d8cf05",
   "metadata": {
    "tags": []
   },
   "source": [
    "- Import the [Vertex AI](https://cloud.google.com/vertex-ai) SDK.\n",
    "- The library helps to interact with the Vertex AI services in the cloud.\n",
    "- Initialize it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfbb69b4-7401-4f4e-9067-86e6f5684a03",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "import vertexai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7802ff08-ac5a-4252-a464-237ab2f86f5a",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "vertexai.init(project = PROJECT_ID,\n",
    "              location = REGION,\n",
    "              credentials = credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20ec9443-b52d-4157-8b03-44774561c375",
   "metadata": {
    "tags": []
   },
   "source": [
    "- Import [BigQuery](https://cloud.google.com/bigquery) to use as your data warehouse.\n",
    "- Initialize the client to start interacting with the data warehouse, send SQL and retrieve data into the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e48d8307-09a5-4518-89ea-6d46cd41353f",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9a99c48-ce21-4628-9bf6-ec16aeea9e51",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "bq_client = bigquery.Client(project=PROJECT_ID,\n",
    "                            credentials = credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33cf86c4-e2e5-416f-ab57-04f53cbfa703",
   "metadata": {},
   "source": [
    "## Stack Overflow Public Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44d8c159-5b14-40f1-a657-fdd451a82ddf",
   "metadata": {},
   "source": [
    "- You will use [Stack Overflow Data](https://cloud.google.com/blog/topics/public-datasets/google-bigquery-public-datasets-now-include-stack-overflow-q-a) on BigQuery Public Datasets.\n",
    "- The datasets include questions, answers and metadata related to Stack Overflow questions. Within this dataset, there are tables with data.\n",
    "- Create a SQL query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27a3e57a-8bea-44bb-891d-951d33997be9",
   "metadata": {
    "height": 114
   },
   "outputs": [],
   "source": [
    "QUERY_TABLES = \"\"\"\n",
    "SELECT\n",
    "  table_name\n",
    "FROM\n",
    "  `bigquery-public-data.stackoverflow.INFORMATION_SCHEMA.TABLES`\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0597e2d5-959e-475e-8434-b9da91c84739",
   "metadata": {},
   "source": [
    "- The query is asking to retrieve `table_name` of all the `TABLES`\n",
    "- Use the client to send your SQL and retrieve the data (tables names)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca80b70a-c989-4448-b9bf-bafbadef0d17",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "query_job = bq_client.query(QUERY_TABLES)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b3bccd0-7f04-4864-901b-402e505f341b",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "for row in query_job:\n",
    "    for value in row.values():\n",
    "        print(value)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "861de62b-a94d-4b59-98fc-f7f027b0b13c",
   "metadata": {},
   "source": [
    "## Data Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4111b8f-c9ab-4b47-9a85-3430650c375c",
   "metadata": {},
   "source": [
    "- You'll fetch some data from the data warehouse and store it in Pandas dataframe for visualization.\n",
    "- Select all columns from  `posts_questions` and put the `LIMIT` as 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ffec7eb-7ba4-472e-85ba-39fce6ab706f",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "INSPECT_QUERY = \"\"\"\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `bigquery-public-data.stackoverflow.posts_questions`\n",
    "LIMIT 3\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccbe4aca-8d5b-4e3f-8edd-41e32fca49bf",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94d974a6-d922-46ca-89fb-c29abbed0128",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_job = bq_client.query(INSPECT_QUERY)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6bd733b-e0d9-452c-a2d9-ad4072e51c78",
   "metadata": {
    "tags": []
   },
   "source": [
    "- Take the results of the query `-->` create an arrow table (which is part of [Apache Framework](https://arrow.apache.org/docs/index.html)) `-->` which goes into a Pandas dataframe.\n",
    "- This allows for data to be in a format which is easier to read and explore with Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "491d400f-a9a4-45bf-92f7-666e2ca6bda4",
   "metadata": {
    "height": 97,
    "tags": []
   },
   "outputs": [],
   "source": [
    "stack_overflow_df = query_job\\\n",
    "    .result()\\\n",
    "    .to_arrow()\\\n",
    "    .to_pandas()\n",
    "stack_overflow_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59d1fac8-8e5e-4986-ae30-9e374ff5078b",
   "metadata": {},
   "source": [
    "### Dealing with Large Datasets\n",
    "\n",
    "- Large datasets for LLMs often don't fit into memory.\n",
    "- Select all of the columns and rows of the table `posts_questions`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "863e9ce6-206b-4032-8eb8-9e6204d8b158",
   "metadata": {
    "height": 114,
    "tags": []
   },
   "outputs": [],
   "source": [
    "QUERY_ALL = \"\"\"\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "632b3e00-5014-48f3-860c-895ed7b1e263",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_job = bq_client.query(QUERY_ALL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca55b26f-1e9c-46f2-933b-0d2e85269948",
   "metadata": {
    "height": 131,
    "tags": []
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    stack_overflow_df = query_job\\\n",
    "    .result()\\\n",
    "    .to_arrow()\\\n",
    "    .to_pandas()\n",
    "except Exception as e:\n",
    "    print('The DataFrame is too large to load into memory.', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9999ff41-e115-4491-84c3-9d7f04fafd85",
   "metadata": {
    "tags": []
   },
   "source": [
    "**Note:** The data is too large to return, as it is not fitting into memory."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d406553f-c356-4dc8-a119-7d8f02787d4b",
   "metadata": {},
   "source": [
    "#### Joining Tables and Query Optimization\n",
    "\n",
    "- When working with (large) data, query optimizing is needed in order to save time and resources.\n",
    "- Select questions as `input_text` (column 1), answers as `output_text` (column 2).\n",
    "- Take the questions from `posts_questions` and answers from `posts_answers`.\n",
    "- Join the questions and their corresponding accepted answers based on their same `unique ID`.\n",
    "- Making sure the question is about `Python`, and that it `has an answer`. And the date the question was posted is on or after `2020-01-01`\n",
    "- Limit as 10,000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71dc0eb0-ca3d-4d29-ae7e-f3b40cab9184",
   "metadata": {
    "height": 301,
    "tags": []
   },
   "outputs": [],
   "source": [
    "QUERY = \"\"\"\n",
    "SELECT\n",
    "    CONCAT(q.title, q.body) as input_text,\n",
    "    a.body AS output_text\n",
    "FROM\n",
    "    `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "JOIN\n",
    "    `bigquery-public-data.stackoverflow.posts_answers` a\n",
    "ON\n",
    "    q.accepted_answer_id = a.id\n",
    "WHERE\n",
    "    q.accepted_answer_id IS NOT NULL AND\n",
    "    REGEXP_CONTAINS(q.tags, \"python\") AND\n",
    "    a.creation_date >= \"2020-01-01\"\n",
    "LIMIT\n",
    "    10000\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8affe03b-6be6-4be0-ad09-703f47dc31b9",
   "metadata": {
    "height": 29,
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_job = bq_client.query(QUERY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d23c0df2-2365-4a49-9dee-f49c800312dd",
   "metadata": {
    "height": 114,
    "tags": []
   },
   "outputs": [],
   "source": [
    "### this may take some seconds to run\n",
    "stack_overflow_df = query_job.result()\\\n",
    "                        .to_arrow()\\\n",
    "                        .to_pandas()\n",
    "\n",
    "stack_overflow_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05bdc676-a15d-4d28-a95c-7b451c6427b7",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Adding Instructions\n",
    "\n",
    "- Instructions for LLMs have been shown to improve\n",
    "model performance and generalization to unseen tasks [(Google, 2022)](https://arxiv.org/pdf/2210.11416.pdf).\n",
    "- Wihtout the instruction, it is only question and answer. Model might not understand what to do.\n",
    "- With the instructions, the model gets a guideline as to what task to perform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66208ed7-7325-4613-8b57-fedfa9c85aef",
   "metadata": {
    "height": 131,
    "tags": []
   },
   "outputs": [],
   "source": [
    "INSTRUCTION_TEMPLATE = f\"\"\"\\\n",
    "Please answer the following Stackoverflow question on Python. \\\n",
    "Answer it like you are a developer answering Stackoverflow questions.\n",
    "\n",
    "Stackoverflow question:\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "645069aa-7e45-4fa3-91e4-665b9f7c17a4",
   "metadata": {
    "tags": []
   },
   "source": [
    "- A new column will combine `INSTRUCTION_TEMPLATE` and the question `input_text`.\n",
    "- This avoids overwritting of any existing column which might be needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed30a78b-6b4a-4b07-b66f-b5147e6ec062",
   "metadata": {
    "height": 63,
    "tags": []
   },
   "outputs": [],
   "source": [
    "stack_overflow_df['input_text_instruct'] = INSTRUCTION_TEMPLATE + ' '\\\n",
    "    + stack_overflow_df['input_text']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "007f22f9-cb41-4035-842e-bd6b27c01bf4",
   "metadata": {},
   "source": [
    "### Dataset for Tuning\n",
    "\n",
    "- Divide the data into a training and evaluation. By default, 80/20 split is used.\n",
    "- This (80/20 split) allows for more data to be used for tuning. The evaluation split is used as unseen data during tuning to evaluate performance.\n",
    "- The `random_state` parameter is used to ensure random sampling for a fair comparison. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7dc2940f-735a-4fe0-b579-b8e63d161d57",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6c97888-7eb8-49f0-b3e7-2a417d931b49",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "train, evaluation = train_test_split(\n",
    "    stack_overflow_df,\n",
    "    ### test_size=0.2 means 20% for evaluation\n",
    "    ### which then makes train set to be of 80%\n",
    "    test_size=0.2,\n",
    "    random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cb841d8-e71a-464c-8cb3-3c712a16dc6a",
   "metadata": {},
   "source": [
    "#### Different Datasets and Flow\n",
    "\n",
    "- Versioning data is important.\n",
    "- It allows for reproducibility, traceability, and maintainability of machine learning models.\n",
    "- Get the timestamp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0b0f876-10bf-4dbf-ae0f-1f90d62cb72c",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf919020-de23-4e02-895f-3cf22fe75f91",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "date = datetime.datetime.now().strftime(\"%H:%d:%m:%Y\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5989a1ea-1bdb-4e43-a4c1-fec76812c26d",
   "metadata": {},
   "source": [
    "- Generate a `jsonl` file.\n",
    "- Name it as `tune_data_stack_overflow_python_qa-{date}`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7427ff3f-6f3a-4053-982c-3aa1fcfb656a",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "cols = ['input_text_instruct','output_text']\n",
    "tune_jsonl = train[cols].to_json(orient=\"records\", lines=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6727f808-a216-45e3-8073-4545e23c796d",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "training_data_filename = f\"tune_data_stack_overflow_\\\n",
    "                            python_qa-{date}.jsonl\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f639359-f873-404f-9fd7-bf88db8c52a0",
   "metadata": {
    "height": 46,
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(training_data_filename, \"w\") as f:\n",
    "    f.write(tune_jsonl)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44daae92-2502-4913-a53f-7dee952797fe",
   "metadata": {},
   "source": [
    "## Try it Yourself! - Evaluation Set\n",
    "\n",
    "The code above generted a `jsonl` file for the `train` set. Now, its time for you to make the `evaluation` set, which you can name as `tune_eval_data_stack_overflow_python_qa-{date}.jsonl`. The code for that is also provided to you in the drop down below, but we encourage you to try it yourself first before you look at it.\n",
    "\n",
    "<details>\n",
    "  <summary><font size=\"2\" color=\"darkgreen\"><b>Code for Evaluation Set (Click to expand)</b></font></summary>\n",
    "    \n",
    "```python\n",
    "\n",
    "cols = ['input_text_instruct','output_text']\n",
    "### you need to use the \"evaluation\" set now\n",
    "tune_jsonl = evaluation[cols].to_json(orient=\"records\", lines=True)\n",
    "\n",
    "### change the file name\n",
    "### use \"tune_eval_data_stack_overflow_python_qa-{date}.jsonl\"\n",
    "evaluation_data_filename = f\"tune_eval_data_stack_overflow_\\\n",
    "                            python_qa-{date}.jsonl\"\n",
    "\n",
    "### write the file\n",
    "with open(evaluation_data_filename, \"w\") as f:\n",
    "    f.write(tune_jsonl)\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92813bc1-1a68-473d-be14-d5ee480faedc",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "738b9e97-5f4f-49ae-ba3f-d130dbb6674b",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "python3",
   "name": "tf2-cpu.2-11.m113",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-cpu.2-11:m113"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
